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autoregressive  (all  pole)  model  of  speech  was  proven  to  be  a 
viable  theory  for  changing  frequency  content.  Since  linear 
predictive  coding  (LPC)  techniques  are  used  to  code,  analyse  and 
synthesize  speech,  with  the  resulting  LPC  coefficients  related  to 
the  coefficients  of  an  equivalent  autoregressive  model,  a  linear 
relationship  between  LPC  coefficients  and  frequency  transposition 
is  explored.  This  theoretical  relationship  is  first  established 
using  a  pure  sine  wave  and  then  is  extended  into  processing 
speech.  The  resulting  speech  synthesis  experiments  failed  to 
substantiate  the  conjectures  of  thi3  thesis.  However,  future 
research  avenues  are  suggested  that  may  lead  toward  a  viable 
approach  to  transpose  speech. 
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Frequency  transposition  is  the  process  of  raising  or 
lowering  the  frequency  content  < pitch >  of  an  audio  signal. 
The  hearing  impaired  community  has  the  greatest  interest  in 
the  applications  of  frequency  transposition.  Though  several 
analog  and  digital  frequency  transposing  hearing  aid  systems 
have  been  built  and  tested,  this  thesis  investigates  a 
possible  digital  processing  alternative.  Pole  shifting,  in 
the  z-domain,  of  an  autoregressive  (all  pole)  model  of 
speech  was  proven  to  be  a  viable  theory  for  changing 
frequency  content.  Since  linear  predictive  coding  (LPC) 
techniques  are  used  to  code,  analyze  and  synthesize  speech, 
with  the  resulting  LPC  coefficients  related  to  the 
coefficients  of  an  equivalent  autoregressive  model,  a  linear 
relationship  between  LPC  coefficients  and  frequency 
transposition  is  explored.  This  theoretical  relationship  is 
first  established  using  a  pure  sine  wave  and  then  is 
extended  into  processing  speech.  The  resulting  speech 
synthesis  experiments  failed  to  substantiate  the  conjectures 
of  this  thesis.  However,  future  research  avenues  are 
suggested  that  may  lead  toward  a  viable  approach  to 
transpose  speech.  r 
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I.  INTRODUCTION 


A.  BACKGROUND 

Adjusting  the  frequency  content  or  pitch  of  a  signal 
is  a  topic  researched  within  the  audio  field.  The  hearing 
impaired  community  has  the  greatest  interest  in  the 

t 
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applications  of  frequency  modification  or  transposition 
techniques.  This  is  due  to  their  need  for  auditory  speech¬ 
processing  aids. 

Auditory  speech-processing  aids  are  divided  into  two 
groups:  those  which  involve  nonradical  processing  of  the 
speech  signal,  with  the  speech  still  intelligible  to  a 
person  with  normal  hearing,  and  those  which  involve  radical 
re-coding  of  the  speech  signal  CRef.  l:pp.  547-557]. 

An  example  of  radical  recoding  involves  such  systems  as 


cochlear 

implants 

where 

the 

normal  speech 

signal  is 

processed 

into  a 

series 

of 

vibrations  that 

the  brain 

interprets  as  sound.  Individuals  who  have  this  type  of  aid 
surgically  inserted  in  their  cochlear  must  learn  a 
completely  different  language  than  a  person  with  normal 
hearing.  Examples  of  nonradical  processing  aids  include  the 
most  widely  used  amplifier  aids  and  the  less  familiar 
frequency  lowering  devices  or  frequency  transposition 


systems . 


Most  hearing  aids  amplify  sound 


Some  aids  may  amplify 


or  soften  certain  frequencies,  while  others  transmit  sound 
from  the  aid  on  one  ear  to  the  aid  on  the  other  ear.  Their 
primary  purpose,  in  either  case,  is  to  amplify  everything 
they  are  capable  of  sensing.  In  this  thesis,  however,  we 
are  interested  in  developing  an  algorithm  that  may  someday 
drive  an  aid  which  lowers  the  frequency  content  and 
preserves  the  intelligibility  of  the  speech  signal. 

B.  FREQUENCY  MODIFICATION 

Pickett  CRef.  2:pp.  191-1943  categorizes  two  basic 

methods  that  have  been  used  for  frequency  lowering: 

1.  Frequency  transposition,  where  a  portion  of  the 
signal  is  separated  out  and  resynthesized  in  a  lower 
frequency  band. 

2.  Frequency  division,  where  the  frequency  of  the 
signal  is  reduced  by  a  fixed  ratio. 

All  of  the  methods  involve  signal  distortion.  Signal 
distortion  tends  to  increase  with  greater  frequency  shifts. 
Here  we  are  concerned  primarily  with  the  idea  of  moderate 
frequency  transposition,  where  the  signal  is  shifted  without 
major  distortions  in  the  information  content. 

The  earliest  known  suggestion  of  frequency  lowering  was 
by  Perwitschky  <1925).  The  earliest  transposing  hearing  aid 
was  built  and  tested  by  Johansson  (1955).  Since  then,  there 
have  been  several  other  systems  built  and  tested,  but 
considering  the  advances  and  trends  of  current  technology. 
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research  in  the  area  of  frequency  transposition  of  speech 
has  not  been  productive. 

Frequency  transposition  systems  have  utilized  analog 
techniques  such  as  frequency  modulation  (shifting  an  upper 
band  to  a  lower  band) ;  frequency  division  (a  alow  playback 
of  a  tape  recorded  signal) ;  and  digital  techniques  such  as 
sampling  distortion  (omitting  segments  of  recorded  speech) , 
and  doppler  (the  delaying  of  the  incoming  signal).  Though 
these  methods  have  been  developed  and  extensively  tested, 
the  digital  approach  presented  here  may  produce,  all 
together,  different  results. 

Pickett  confirms  that  the  possibilities  for  usable 
frequency  shifting  algorithms  have  not  been  explored 
extensively  enough  to  make  recommendations  for  practice 
[Ref.  2:p.  193] .  The  research  needs  in  this  area  include 
obtaining  new  information  on  the  potential  for  digital  re¬ 
coding,  exploring  the  principles  of  transposition,  finding 
which  general  cues  can  be  sent  in  this  way,  finding  the 
optimum  parameters,  and  examining  what  system  can  be  built 
that  meets  our  general  and  specific  needs. 

C.  A  NEW  TECHNIQUE  FOR  FREQUENCY  TRANSPOSITION 

Recently,  Hall  CRef.  3:p.  56]  postulated  that  pole 
shifting  in  the  z-domain  using  an  auto-regressive  (all  pole) 
model  of  speech  may  be  a  possible  option  for  frequency 
lowering.  He  used  linear  predictive  coding  (LPC)  techniques 


to  procass  the  speech  to  determine  if  pole  shifting  was  a 
viable  option.  His  experimental  results  were  positive 
because  he  was  able  to  create  a  change  in  pitch  on  the  input 
speech  segment. 

This  thesis  is  an  extension  of  Hall's  research.  It 
ventures  beyond  the  frequency  domain  model ,  and  works 
directly  with  the  linear  predictive  time  domain  model.  It 
was  postulated  that  a  linear  relationship  exists  between 
frequency  content  and  the  reflection  coefficients  determined 
using  LPC.  Once  this  theory  has  been  postulated.  a  speech 
processing  experiment  was  undertaken  to  determine  if  the 
conjectures  made  were  plausible. 

In  this  report  linear  prediction  is  introduced,  the 
particular  algorithms  used  to  process  the  data  are 
explained,  and  experimental  research  was  carried  out. 
Identical  phrases  of  speech,  spoken  at  different  pitch 
levels  by  the  same  speaker,  are  sampled  and  processed. 
Possible  patterns  existing  between  the  different  pitch 
segments  of  speech  and  their  linear  predictive  coefficients 
are  analyzed. 

The  results  of  this  research  indicate  that  there  is  no 
linear  relationship  that  exists  between  the  frequency 
content  of  speech  and  the  LPC  reflection  coefficients,  and 
recommendations  are  made  for  continued  analysis  concerning 
linear  predictive  coding  and  the  frequency  transposition  of 


speech . 


II 


MODELING  SPEECH  PRODUCTION 


A.  INTRODUCTION 

In  order  to  understand  speech  reproduction  and 
synthesis,  it  is  useful  to  consider  some  of  the  basic 
elements  that  combine  to  produce  speech.  The  most 
elementary  model  used  to  explain  the  production  of  speech 
is  the  human  model  illustrated  below  as  Figure  1. 


Human  Speech  Production  System  CRef .  4:p.  42]. 

Figure  1. 

The  lungs  produce  the  air  flow  necessary  to  begin  the 
generation  of  sound.  The  vocal  cords,  tongue,  mouth,  lips 
and  nasal  tract  combine  their  different  properties  to  shape 
the  airflow  to  produce  the  speech  waveform  we  hear. 

13 


B.  THE  SPEECH  PRODUCTION  MODEL 


Evans  CRef.  4:pp.  40-45]  relates  the  several  human 
functions  to  mechanical  models.  This  is  standard  practice 
and  a  widely  accepted  approach  to  speech  production 
modeling.  He  states  that  the  lungs  are  the  excitation 
source  for  the  vocal  and  nasal  tract  areas.  An  excitation 
source  can  either  be  modeled  as  a  pulse  train  generator  or  a 
random  number  generator  when  reproducing  speech. 

In  the  case  of  voiced  sounds  (ie.  consonants,  vowels  or 
nasal  sounds),  the  air  released  by  the  lungs  is  periodically 
modulated  by  vibrations  from  the  vocal  cords.  glottis,  and 
velum.  Thus  the  excitation  model  in  this  case  is  a  pulse 
generator.  In  the  case  of  unvoiced  sounds  (ie.  sh,  sss, 
fff)  which  require  no  vibrations  to  be  produced,  the  modeled 
excitation  source  is  a  random  number  generator. 

Both  excitation  sources  produce  a  quasi -per iodic  wave 
form  that  we  recognize  as  speech.  That  is,  the  period  of 
the  wave  form  varies  with  time  depending  on  the  sound  being 
produced.  This  phenomena  is  most  obvious  in  the  production 
of  voiced  or  vibrated  sounds.  Figure  2,  a  general  discrete¬ 
time  model  of  the  human  speech  process,  illustrates  this 
point  more  clearly.  Here  we  have  represented  the  vocal 
tract  model  as  a  time-varying  digital  filter. 

Note  that  the  pulse  train  has  an  input  labeled  pitch 
This  input  determines  when  the  pulses  will  be 
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period 


Pi  fCH  PERIOD 


GENERATOR 


•Kitted  from  the  pulse  generator  and  at  what  periodicity 
This  is  only  necessary  for  voiced  speech. 


The  unvoiced  speech  is  a  continuous  stream  of  random 
numbers  commonly  referred  to  as  white  noise.  The  flow  of 
random  numbers  may  produce  a  seemingly  quasi-periodic  sound, 
however,  since  they  are  usually  of  such  short  duration,  we 
consider  the  sound  to  be  continuous  and  constant,  and  not 
periodic. 

Each  speech  waveform  has  a  specific  amount  of  energy. 
The  energy  contained  within  each  utterance  of  a  set  duration 
will  be  referred  to  as  gain  <G> .  This  is  what  gives  speech 
its  body  or  quality.  It  also  aids  reproduction  by 
indicating  the  intensity  or  inflection  of  the  voice  signal. 

Once  the  voiced  or  unvoiced  decision  is  made  and  an 
energy  or  gain  is  assigned,  the  scaled  excitation  function 
drives  the  vocal  tract  model.  In  a  phone  interview  with 
James  Kaiser  of  Bell  Laboratories,  he  mentioned  that 
current  thinking  in  the  area  of  speech  reproduction  has 
refocused  its  attention  on  this  portion  of  the  model  and 
that  there  is  a  movement  to  more  clearly  describe  the 
physics  behind  the  different  physical  contributors  of 
speech . 

This  vocal  tract  model  is  driven  by  the  excitation  and 
energy  function  and  controlled  by  time  varying  vocal  tract 
parameters.  These  vocal  tract  parameters  adjust  the  vocal 
tract  model  to  yield  the  desired  output  waveform.  By 
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replacing  the  vocal  tract  model  with  an  equivalent  time- 
varying  digital  filter  that  models  the  vocal  tract  model's 
response,  we  are  able  to  step  right  into  the  next  phase  of 
synthetic  speech  reproduction. 

C.  DIGITAL  FILTER  REPRESENTATION 

Although  speech  is  modeled  most  efficiently  by  poles  and 
zeros,  it  may  also  be  modeled  accurately  by  an  auto¬ 
regressive  (all  pole)  filter  if  the  order  of  the  filter  is 
large  enough.  For  example,  a  tenth  order  auto-regressive 
filter  will  accurately  model  most  audible  sounds. 
Therefore,  the  transfer  function  <H(z>>  of  the  digital 
filter  in  Figure  4.  is  shown  as  Eq.  1-1. 

G 

H(z)  a - -  (2-1) 

P 

1  -  ■  a^zfc 

K=  i 

where  p  is  the  order  of  the  filter,  G  is  the  gain,  and  aj*  is 
the  filter  coefficient. 

G  and  a^  are  the  time-varying  vocal  tract  parameters  for 
this  filter.  For  a  given  segment  of  time  (i.e.,  10  milli¬ 

seconds)  the  vocal  tract  parameters  are  constant.  However, 
stringing  these  segments  together  in  rapid  succession  to 
produce  a  one  second  interval  of  speech,  the  parameters  will 
change  100  times.  This  is  why  they  are  referred  to  as  time 
varying;  they  vary  over  a  short  period  of  time. 


The  type  of  digital  filter  used  in  Figure  2  ia 
arbitrary.  It  is  the  concept  behind  the  diagram  that 
counts.  For  the  purposes  of  this  research.  the  properties 
and  attributes  of  a  time-varying  lattice  filter  are  best 
because  they  lend  themselves  well  to  linear  predictive 
coding  implementation. 


III.  LINEAR.  PREDICTION_Ta£QRY 


A.  WHY  LINEAR  PREDICTION? 

Although  spectral  analysis  is  a  wall-known  tachniqua  for 
studying  signals,  its  application  to  spaach  signals  suffers 
fro*  a  nusbar  of  aarioua  limitations  arising  from  the 
nonstationary  as  wall  as  tha  quasipariodic  proparties  of  the 
spaach  wava.  By  modal ing  tha  apaach  wave  itself,  rather 
than  its  spectrum,  wa  avoid  tha  problems  inherent  in 
frequency-domain  methods. 

For  instance,  traditional  Fourier  analysis  methods 
require  a  relatively  long  speech  segment  to  provide  adequate 
spectral  resolution.  As  a  result,  rapidly  changing  speech 
events  cannot  be  accurately  followed  [Ref.  5:pp.  276-294]. 

Linear  predictive  coding  is  applicable  to  a  wide  range 
of  research  problems  including  speech  production  and 
perception.  One  of  the  main  objectives  in  any  speech 
processing  technique  is  the  synthesis  of  speech  which  is 
indistinguishable  from  normal  human  speech. 

Atal  noted  that  much  can  be  learned  about  the 
information-carrying  structure  of  speech  by  selectively 
altering  the  properties  of  the  speech  signal.  He  also 
stated  that  LPC  techniques  can  serve  as  a  tool  for  modifying 
the  acoustic  properties  of  the  speech  signal  [Ref.  5: p.276] . 
These  are  exactly  the  intentions  of  this  thesis:  to  modify 


the  speech  signal  by  investigating  the  properties  of  the 
information  carrying  structure. 

The  remainder  of  this  chapter  is  a  summary  of  linear 
prediction  theory.  The  major  portion  of  this  section  is 
extracted  from  Makhoul's  tutorial  review  on  linear 
prediction  [Ref.  6:pp.  124-143],  and  will  be  based  on  an 

intuitive  approach,  with  emphasis  on  the  clarity  of  ideas 
rather  than  mathematical  rigor. 

B.  LPC  THEORY 

In  applying  time  series  analysis,  each  continuous  signal 
s(t)  is  sampled  to  obtain  a  discrete-time  signal  s(nT),  also 
known  as  a  time  series,  where  n  is  an  integer  variable  and  T 
is  the  sampling  interval.  The  sampling  frequency  is  then 
fa*l/T.  Note  that  s<nT)  will  be  represented  as  sn  in  this 
discussion . 

The  signal  sn  is  considered  to  be  the  output  of  some 
system,  with  some  unknown  input  un  such  that  the  following 
relation  holds: 

<*  fr 

an  s  -  ^  «kan-k  +  G  biun  (3-1) 

K  a  |  C-V 

where  a^,  bi,  and  the  gain  G  are  the  parameters  of  the 
hypothesized  system.  This  equation  says  that  the  'output' 
sn  is  a  linear  combination  of  past  outputs  and  present  and 
past  inputs.  That  is,  the  signal  sn  is  predictable  from 
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linear  combinations  of  past  outputs  and  inputs 


Hence  the 


name  linear  prediction. 

C.  PARAMETER  ESTIMATION 

In  the  all-pole  model,  we  assume  that  the  signal  sn  is 
given  as  a  linear  combination  of  its  past  values  and  some 
current  input  un  : 


P 

sn  =  -  ^  «kan-k  ♦  Gun  C3-2) 

ksi 

which  yields  the  following  frequency  domain  transfer 
function 

G 

H  (z)  =  -  (3-3) 

P 

1  *  ^  akz"k 

it-i 

Given  a  particular  signal  sn,  the  problem  is  to  determine 
the  predictor  coefficients  ta^)  and  the  gain  G  in  some 
manner . 

1 .  Method_of _Least_Sguares 

Here  we  assume  that  the  input  un  is  totally 
unknown,  which  is  the  case  of  speech  analysis.  Therefore, 
the  signal  sn  can  at  best  be  approximately  predicted  from  a 
linearly  weighted  summation  of  past  samples. 

V\ 

approximation  of  sn  be  sn,  where 


Let  the 


<  3-4  > 


P 

»n  ’  -  ^  «k«n-k 
Kt  I 

Then  the  error  between  the  actual  value  sn  and  the  predicted 
value  sn  is  given  by 

-  ^ 

en  s  an  -  an  *  an  ♦  ^  skan-k  (3-5) 

The  quantity  en  is  also  known  as  the  residual.  In  the 
method  of  least  squares  the  parameters  (a^}  are  obtained  as 
a  result  of  the  minimization  of  the  expected  value  or  mean 
of  the  error  squared  term,  Ep  *  <£*  (en2),  with  respect  to 
each  of  the  parameters.  Ep  is  the  minimum  mean  square 
prediction  error,  averaged  over  all  n,  and  is  represented  by 

oo  f  p 

Ep  =£<en2>  =  £  Un  ♦  ^  ak  sn-k  2  (3-6) 

A*  I  **  K«l  J 

For  any  definition  of  the  signal  sn,  a  set  of 
equations  with  a  set  of  unknowns  can  be  solved  for  the 
predictor  coefficients  which  minimize  Ep. 

There  are  two  distinct  methods  for  the  estimation  of 
these  parameters,  namely  the  autocorrelation  method  and  the 
covariance  method.  Both  methods  are  clearly  described  by 
Makhoul  [Ref .  6:pp.  126-1273.  Since  the  autocorrelation 

method  is  the  preferred  method,  only  that  method  will  be 


summarized  here. 


a.  Autocorr el at ign_Met hod 


Here  we  assume  that  the  error  Ep  Is  minimized 
over  an  infinite  duration.  Since 

R < i >  =  Sn  sn+i  (3-7) 

is  the  autocorrelation  function  of  the  signal  3n, 
Equation  3-6  reduces  to 

P 

Ep  =  R  <0)  ♦  £  a^  R(k>  (3-8) 

k*i 

where  R<0)  is  the  total  energy  of  the  input  signal  and  R(k) 
is  the  autocorrelation  matrix  of  the  input  signal  (see 
Figure  3) . 
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Autocorrelation  Matrix 
Figure  3. 

It  is  a  symmetric  toeplitz  matrix  (a  toeplitz 
matrix  is  one  in  which  all  the  elements  along  the  diagonal 
are  equal).  Since  the  signal  sn  is  known  over  only  a  finite 


interval 


one  popular  method  to  control  the  size  of  the 


toeplitz  matrix  is  to  multiply  the  signal  sn  by  a  window 
function  wn.  This  yields  a  slightly  different  signal  s'n, 
which  is  zero  outside  the  finite  interval. 

In  any  case,  the  autocorrelation  matrix  is  the 
means  for  solving  several  of  the  linear  predictive 
coefficients  needed  to  analyze  and  synthesize  speech.  The 
following  chapter  discusses,  in  greater  depth,  what  those 
coefficients  are  and  how  they  are  obtained. 


IV 


LINEAR  PREDICTION  OF  SPEECH 


A.  INTRODUCTION 

As  mentioned  earlier,  there  are  several  ingredients  or 
time-varying  parameters  that  are  needed  to  generate  speech. 
When  using  linear  predictive  coding  techniques,  three 
ingredients  are  essential:  gain  or  energy,  pitch  period,  and 
the  filter  reflection  coefficients  or  spectral  envelope 
parameters . 

Figure  4  illustrates  the  fact  that,  depending  on  the 
specified  frame  length,  these  ingredients  must  change  every 
10  to  20  ms.  On  a  frame-by-frame  basis  the  incomming  signal 
is  processed  to  obtain  the  gain,  the  pitch  period  and  the 
reflection  coefficients  kl ,  k2,...,kN. 

The  pitch  period  and  the  gain  parameters  are  used  to 
construct  an  excitation  function  for  production  of  either 
voiced  or  unvoiced  speech.  This  driving  or  excitation 
function  is  input  to  a  filter  which  is  configured  by  the 
spectral  envelope  parameters  determined  from  the  analysis. 
The  output  is  one  frame  of  synthetic  speech,  and  by 
stringing  several  frames  of  speech  together,  audible  sounds 
are  produced  CRef .  7:pp.  337  -  345] . 

Analysis  of  the  speech  signal  is  done  by  calculating  tne 
LPC  model  parameters  for  each  10  ms  time  frame.  This 


chapter  will  discuss  these  essential  parameters. 


B.  LPC  ENCODING  PARAMETERS 


1 .  Voiced_/_Unvoiced_Decision_Making 

Some  sounds  require  the  vibrations  induced  by  the 
vocal  cords,  while  others  do  not.  Voiced  sounds  represent 
those  that  require  an  excitation  from  the  vocal  cords  or 
lips.  Unvoiced  sounds  are  generated  by  a  steady  flow  o±  air 
as  in  the  case  of  's'  or  'f'.  A  decision  must  be  made  in 
order  to  properly  excite  the  digital  filter  to  produce  the 
desired  sounds. 

According  to  Atal  [Ref.  5:p.  280]  the  voiced/ 

unvoiced  decision  is  based  on  the  ratio  of  the  mean-squared 
value  of  the  speech  samples  to  the  mean-squared  value  of  the 
prediction  error  samples.  This  ratio  is  considerably 
smaller  for  unvoiced  speech  sounds  than  for  voiced  speech 
sounds.  Typically,  this  ratio  is  a  factor  of  10. 

Voiced  Decision:  ECsn]  >  10  ECen] 

Unvoiced  Decision:  ECsn3  <  10  ECen] 

This  decision  will  determine  whether  to  excite  the 
digital  filter  with  an  impulse  function  or  white  noise,  each 
having  a  particular  gain  or  energy. 

2.  Gain_Cgmgutatign 

In  explaining  the  least  squares  method  of  iinear 
prediction  we  assumed  that  the  input  was  unknown. 


Equation  3-5  can  be  rewritten  as 


an  *  “  ^  «k  an-k  ♦  en  <4-l) 

K  - 1 

Comparing  Equations  3-2  and  4-1  we  see  that  the  only  input 
signal  un  that  will  result  in  the  signal  sn  as  output  is 

that  where  Gun  3  en  •  That  is,  the  input  signal  is 

proportional  to  the  error  signal.  For  any  other  input  the 
output  will  be  different  than  sn  .  Therefore  the  energy  of 
the  input  signal  must  be  equal  to  the  energy  of  the  output 
signal  sn  . 

Since  the  filter  H(z)  is  fixed,  it  is  clear  from  the 
above  that  the  total  energy  in  the  input  signal  Gun  must 

equal  the  total  energy  in  the  error  signal,  which  is  given 

by  Ep.  Again,  Makhoul  [Ref.  6:p.  128]  is  the  primary  source 
for  this  information  and  he  provides  additional  mathematical 
background  in  determining  the  resultant  gain  equation 

r* 

G^  =  Ep  =  R<0)  +  a^  R<k>  (4-2) 

KM 


where  G^  is  the  total  energy  in  the  input  and  R(k>  is. 
again,  the  autocorrelation  matrix. 

The  classification  of  a  sound  as  voiced  or  unvoiced 
determines  the  input  to  the  filter  H(z) .  However  if  the 
input  Gun  is  white  noise  or  a  series  of  impulses. 


is  calculated  from  the  same  equation. 


the  gain 


3 


Pitch  Period 


The  period  of  time  that  elapses  between  each 
excitation  pulse  is  referred  to  as  the  pitch  period.  Atal 
[Ref.  5:p.  279]  describes  two  different  methods  for 
determining  pitch  period.  His  second  method  is  summarized 
here  since  it  is  based  on  the  linear  predictive 
representation  of  the  speecn  wave. 

In  this  method,  except  for  a  sample  at  the  beginning 
of  each  pitch  period,  every  sample  of  the  voiced  speech 
waveform  can  be  predicted  from  the  past  values  .  The  method 
of  determining  pitch  period  is  relatively  simple. 

A 

Once  the  prediction  error  of  the  speech  signal  is 
determined  through  linear  predictive  processing,  the  largest 
or  peak  values  are  noted,  (Figure  5).  These  points 
determine  the  times  that  excitation  pulses  should  be 
initiated  from  the  excitation  source.  This  simple  peak¬ 
picking  procedure  was  found  to  be  effective  in  determining 
pitch  period  as  developed  in  Reference  7. 


4.  ion_Coef f icien 


Earlier  it  was  mentioned  that  the  reflection 
coefficients  determined  using  LPC  are  directly  related  to 
the  polynomial  coefficients  of  an  all  pole  model.  This 
section  will  show  the  relationship  between  them  and 
illustrate  how  the  reflection  coefficients  are  determined. 

Recall  that  we  are  looking  for  an  estimated  output 
which  is  the  weighted  sum  of  past  system  outputs  (see 
Eqns.  3-4  and  3-5) .  The  autoregressive  CAR)  model  in 
Figure  6  illustrates  this  process. 


Autoregressive  Model 
Figure  6. 


The  goal  of  LPC  is  to  adjust  the  a^'s  to  minimize 


Ep. 


Achieving  it  involves  solution  of  a  linear  system  of 


equations,  using  Levinson's  algorithm,  and  leads  to  the 
lattice  structure  AR  model  we  are  most  interested  in  (sec 
Figure  7) .  The  mathematical  development  for  this  m ay  be 
found  in  Parker  CRef.  9:pp.  110-112]. 
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Lattice  Structure  Analysis  Model 
.  Figure  7. 

Lattice  structuring  requires  the  determination  of 
reflection  coefficients,  hereafter  referred  to  as  K.  The 
K's  of  an  n-th  order  Lattice  filter  transfer  function  are 
related  to  the  polynomial  coefficients  of  an  nth  order  AR 
filter  transfer  function  through  the  following  matrix 
equation : 


<  N  ♦  1 ) 


•  <  N ) 

<N*1> 

(N> 
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<4-3> 


where 


<N  +  1  > 

~  T(H> 

<N 

Ryy 

ryy 
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T  <N> 

<N> 

Ryy <0> 

«k 

ryy 

The  matrix  ryy  is  the  last  column  of  the  Ryy 
autocorrelation  matrix  mentioned  earlier.  The  notation  has 
been  slightly  altered  from  Parker's  presentation  [Ref.  9:p. 
112]  to  be  consistent  with  the  preceding  chapters  of  this 
development . 

Equations  4-3  and  4-4  have  been  included  in  this 
presentation  to  show  how  the  polynomial  coefficients  (a^'s) 
are  related  tho  the  reflection  coefficients  (K's). 
However,  there  is  an  easier  and  more  direct  method  towards 
determining  K's.  A  brief  development  is  presented  here. 

Working  in  the  Z-domain,  we  know  that  the  transfer 
function  of  the  AR  model  is 

^  (  d)*  /-<£  /  -A, 
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where 


A  <z>  ia  A  (z)  in  reverse  order 


Combining  and  reforming  in  matrix  form,  yields 


A  M- Vt.  «'~J  $-■] 


<4-7> 


or  more  simply 

A  ,  A^M-  K"+°  fA' "Kj 


and 


Writing  Equation  3-5  in  the  2-domain  yields 


N  N 

E<z>  =  A  <z>  S<z)  (4-10) 


Combining  4-10  with  4-8  and  4-9  and  returning  to  the  time 
domain,  yields  the  following  error  equations. 


a/>d 


rT  tN)  ^  (H-t)  ,  .(Mi 

t  (k)-e  dc-,)-K  el  J(k) 


(4-11) 


(4-12) 


(N+l)  ~ ( N ) 

where  e  (k>  ia  the  forward  difference  error,  and  e  (k)  is 

the  backwards  difference  error.  Equations  4-11  and  4-12 


correspond  to  the  lattice  implementation  in  Figure  7.  They 


have  been  used  to  determine  the  K's  of  a  12th  order  model  in 
the  sine  wave  and  speech  experiments  which  follow. 

The  order  of  the  filter  is  simply  determined  by 
assigning  N.  For  speech,  anywhere  from  a  6th  to  a  12th 
order  model  has  been  found  to  be  sufficient. 

The  reflection  coefficients  are  determined  every  10  to 
20  milli-aeconda  and  when  lined  up  side  by  aide  appear  to 
present  a  spectral  envelope,  (Figure  8). 
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Figure  8. 


Determining  the  reflection  coefficients.  in  any  case. 


s  a  straight  forward  calculation  which  is  an  attractive 


feature  of  LPC.  It  is  the  pattern  these  K's  may  produce  in 
our  experiment  that  we  will  be  most  interested  in. 

5.  §pectral_Analysis 

A  convenient  way  to  portray  the  frequency  content  of 
speech  is  through  the  determination  of  formant  frequencies. 
Formant  frequencies  are  the  most  prominent  frequencies 
present  in  a  speech  waveform. 

Formant  frequencies  are  not  required  to  produce  LPC 
synthesized  speech.  In  other  words,  given  the  voiced 
decision,  gain,  pitch  period,  and  the  reflection 
coefficients,  one  has  enough  information  to  reconstruct  the 
speech  wave  form.  However,  the  determination  of  the  formant 
frequencies  aids  us  in  depicting  a  frequency  transposition. 

a.  E2I!ffiSQi_Ei:§9y?DSig§ 

The  complex  roots  of  the  denominator  polynomial 
are  the  complex  formants  <bandwidths  and  frequencies)  used 
to  approximate  the  speech  signal.  The  coefficients,  , of 

the  denominator  polynomial  are  obtained  from  time-domain 
calculations  on  samples  of  a  short  segment  of  the  speech 
waveform;  namely  (sn)  -  (si  ,  S2  ,...sn),  where  N>>p.  Here 
N  is  the  number  of  samples,  and  p  is  the  order  of  the 
polynomial  [Ref.  ll:pp.  364-366]. 

Under  the  assumption  that  the  waveform 
samples,  sn  ,  are  samples  of  a  random  gaussian  process,  the 
entire  speech  sample  is  broken  up  into  an  equal  number  of 


samples  which  we  will  refer  to  as  segments,  (Figure  9) . 
Each  segment  is  processed  using  the  Fast  Fourier  Transform 
(FFT)  and  then  low  pass  filtered  if  desired. 
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Flow  Chart  for  Obtaining  the  Spectral  Content 
of  One  Complete  Utterance 

Figure  9. 

The  output  of  each  segment  contains  the  spectral 
content  of  that  segment.  Each  segment  is  sequenced  together 
to  yield  a  time-varying  frequency  content  profile  of  the 
entire  utterence  with  each  segment  containing  its  particular 
frequency  content.  The  formant  frequencies  are  the  most 


prevalent,  or  peak,  frequencies  found  in  the  speech  wave 
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C.  SPEECH  SYNTHESIS 

A  speech  signal  is  synthesized  by  using  the  same 
parameters  determined  with  LPC  analysis.  A  block  diagram  of 
a  speech  synthesizer  was  shown  in  Figure  4.  The  control 
parameters  supplied  to  the  synthesizer  are  the  pitch  period, 
a  binary  voiced  or  unvoiced  parameter,  the  rms  value  of  the 
speech  samples  or  gain,  and  the  predictor  or  reflection 
coefficients . 

The  pulse  generator  produces  a  pulse  of  unit  amplitude 
at  the  beginning  of  each  pitch  period.  The  white  noise 
generator  produces  uncorrelated  uniformly  distributed  random 
samples  with  standard  deviation  equal  to  1  at  each  sampling 
instant.  The  selection  between  the  pulse  generator  and  the 
white  noise  generator  is  made  by  the  voiced-unvoiced  switch. 
The  synthesizer  control  parameters  are  reset  to  their  new 
values  at  the  beginning  of  every  pitch  period  for  voiced 
speech  and  once  every  10  msec  for  unvoiced  speech. 

The  amplitude  of  the  excitation  signal  is  adjusted  by 
the  amplifier  G.  The  linearly  predicted  value  sn  of  the 
speech  signal  is  combined  with  the  excitation  signal  un  to 
form  the  n-th  sample  of  the  synthesized  speech  signal.  The 
signal  is  finally  low-pass  filtered  to  provide  the 
continuous  speech  wave  (sn}.  Atal  [Ref.  5:p.  280]  provides 
the  mathematical  development  needed  to  synthesize  these 
parameters.  A  mathematical  discussion  will  not  be  pursued 


further  here. 


V. 


DIGITAL  FREQUENCY  TRANSPOSITION 


A.  INTRODUCTION 

The  object  of  thi3  research  was  to  determine  an 
algorithm  that  will  digitally  transpose  speech  using  linear 
predictive  coding.  In  this  chapter.  Hall's  research  [Ref. 
3]  will  be  briefly  discussed  and  summarized.  A  new  theory 
will  then  be  postulated  and  a  simple  experiment  using  pure 
sine  waves  will  be  presented  to  test  the  credibility  of  the 
theory.  Keep  in  mind  that  the  real  test  will  be  the  actual 
processing  of  speech,  this  section  simply  sets  the  3cene  for 
further  study. 

B.  POLE  SHIFTING  IN  THE  Z-PLANE 

Only  the  highlights  and  summary  of  Hall's  thesis  will  be 
presented  here.  His  goal  was  to  change  the  pole  locations 
before  reconstruction  (of  the  sampled  speech  signal)  to 
produce  the  output  voice  with  different  pitch  and  format 
frequencies  while  retaining  a  natural  sound  and  the  same 
information  CRef.  3:p  47]. 

The  autoregressive  vocal  tract  transfer  function  used  in 
his  research  is  represented  by  Equation  5-1. 


H<2)  => 


(5-1) 


-2TT<BW)TS  -1  -41T<BW)TS  -2 

1  -  2e  cos(2TTFTs  )z  *  e  z 

where  F  is  the  center  frequency  of  the  formant,  and  BW  is 
the  bandwidth  of  the  formant.  The  pole  locations  associated 
with  this  transfer  function  are: 

z  =  A  e  -  39 

Converting  Equation  5-1  into  polar  form  produces  Equation 
5-2. 


1 

H ( z )  = - - -  ( Eqn  5-2) 

-1  2  -2 

1  -  2A  cos  Qz  ♦  A  z 

Through  several  mathematical  manipulations  and  solving 
for  A  and  0,  the  following  relationships  for  F  and  BW  are 
determined : 

F  =  0  /  2  TT  T 
BW  =  <  -In  A  )  /  2 iT  T 

-2  IT  <BW>  T 

where  A  =  e  and  0  =  2  5T  FT 

Assuming  that  a  linear  relationship  exists  between  F 
(the  original  frequency)  and  F'  (the  modified  frequency), 
several  general  expressions  are  stated  to  illustrate  the 


(5-3) 

(5-4) 


underlying  modification  to  the  pole  locations.  Note  that 
the  following  equations  are  all  linear  relationships. 


F' 

3 

$  F 

<5-5> 

BW  " 

3 

<*BW 

<5-6) 

Q' 

3 

0 

<5-7) 

A' 

a 

A* 

(5-8) 

important  consideration  for 

producing  these 

relationships  is  guaranteeing  that  no  unstable  poles  will  be 
created  by  shifting  them  outside  the  unit  circle.  For  more 
of  the  specifics  on  Hall's  development  see  Reference  3, 
pages  49  and  50. 

Two  experiments  are  illustrated  in  Hall's  thesis.  They 

are: 

1.  Pitch  was  reduced  by  a  factor  of  .58  and  the 
formant  frequencies  reduced  by  .88  for  voiced 
speech . 

2.  The  same  modification  was  done  for  a  segment  of 
unvoiced  speech. 

Hall  concluded  that  upon  completion  of  the  process  most 
listeners  agreed  that,  although  the  input  speech  was  female, 
the  modified  output  speech  sounded  typically  male.  It  was 
also  noted  that  although  the  audio  output  was  somewhat 
lacking  in  quality,  it  was  intelligible  CRef  3:p.  73] .  The 
tapes  which  recorded  that  audio  output  are  no  longer 
available  for  subjective  evaluation. 


Linear  predictive  coding  ia  a  means  to  an  end  for  Hall. 
He  modifies  the  the  variables  mentioned  (F,BW,0,A),  and 
processes  the  speech  with  LPC  computer  programs.  This 
conversion  between  an  autoregressive  vocal  track  model  and  a 
LPC  model  (implemented  moat  easily  by  a  lattice  filter 
configuration)  is  possible  through  Equations  (4-3)  and 
(4-4)  . 

The  mathematics  are  simple.  What  is  most  important  here 
is  that  the  relationship  between  the  two  different 
representations  of  speech,  the  AR  model  and  the  LPC  model, 
are  closely  associated  with  one  another.  To  calculate  one, 
in  a  sense,  is  to  calculate  the  other. 

C.  A  NEW  PROPOSITION 

1 •  Statement_o^_Theory 

As  mentioned  earlier,  LPC  techniques  can  serve  as  a 
tool  for  modifying  the  acoustic  properties  of  the  speech 
signal.  This  thesis  postulates  that  a  linear  relationship 
exists  between  the  reflection  coefficients,  which  determine 
the  spectral  envelope  of  the  speech  wave  form,  and  the 
frequency  content  of  that  wave  form.  If  this  relationship 
exists  and  the  linear  relationship  is  determined,  then  by 
selectively  modifying  the  reflection  coefficients,  the 
frequency  content  will  also  be  modified. 

Is  there  a  linear  relationship  between  the 
reflection  coefficients  <K's>  and  frequency  content?  The 
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first  step  in  our  proof  is  to  analyze  the  most  simplified 
case.  Since  speech  is  often  represented  as  a  combination  of 
many  different  frequencies,  the  simplest  case  would  be  to 
analyze  a  fixed  frequency  sine  wave.  If  the  results  turn 
out  to  be  negative,  then  exploring  the  more  complex  case 
(speech)  would  probably  be  futile. 

2.  Sine_Waye_Exgeriment 

At  any  given  frequency  a  pure  sine  wave  may  be 

considered  a  continuous  energy  and  amplitude  signal  which 

will  generate  an  audible  pitch  when  it  is  within  the  200  Hz 
to  15  kHz  audible  range.  When  dealing  with  normal  speech 
wave  forms,  the  audible  pitch  range  is  somewhere  between 
200  Hz  and  5  kHz. 

A  computer  program  was  written  in  Fortran  CApp.  A] , 
for  use  on  the  IBM  3033  to  produce  a  sine  wave  for  further 

analysis.  The  resultant  sine  wave  could  be  sampled  at  any 

desired  rate  and  the  frequency  of  the  wave  could  be 
incremented  to  satisfy  the  range  requirements  of  200  Hz 
5  kHz . 

Once  the  sampling  rate  was  determined  and  the  sine 
wave  frequency  set,  the  reflection  coefficients  were 
calculated  for  a  10ms  time  frame,  3tored  in  a  holding  file 
and  plotted  to  determine  if  a  relationship  exsists  between 
frequency  and  the  nth  order  K's. 
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To  determine  12  reflection  coefficients  (K's)  for 
each  frequency.  Equations  4-11  and  4-12  were  used. 
Additional  runs  were  also  made  to  determine  the  affect  of 
noise  on  the  outcome.  The  results  were  promising. 

3 •  Sine_Wave_Exper i men ta l_R§sul ts 

Appendixes  B  and  C  illustrate  the  apparent  linear 
relationship  that  exsists  between  frequency  and  the  LPC  nth 
order  K's  in  a  noiseless  environment.  Appendixes  D  and  E 
illustrate  that  same  relationship  in  a  noise  environment 
<S:N  =  10:1) . 

It  would  appear  that  a  linear  relationship  does 
exist  between  the  different  frequencies  of  a  sine  wave. 
Noise  on  the  other  hand  changes  that  linear  relationship. 
Noise  addition  seems  to  affect  K7  through  K12  much  more 
than  K1  through  K6 . 

Considering  the  mathematics  involved  in  calculating 
K,  these  observations  are  reasonable.  Since  the  later  K's 
are  affected  most  by  small  changes  in  the  input  signal, 
addition  of  noise  will  affect  them  more  drastically  than  the 
earlier  stages  . 

Though  these  observations  are  promising,  they  are  by 
no  means  conclusive.  If  no  correlation  between  the  K's  and 
frequency  existed,  another  scheme  would  have  had  to  be 
considered.  Nevertheless,  speech  is  the  more  complicated 
signal  that  we  consider  in  the  next  two  sections. 
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SPEECH  PROCESSING  EXPERIMENT 


A.  INTRODUCTION 

Now  that  the  fundamentals  of  linear  predictive  coding 
have  been  presented  and  a  theory  of  frequency  transposition 
proposed,  it  is  necessary  to  work  directly  with  speech 
itself.  To  obtain  the  information  we  are  seeking,  the 
correlation  between  reflection  coefficients  and  frequency 
content,  speech  samples  must  be  demonstrated. 

Documentation  concerning  the  data  acquisition  system 
used  in  this  research  to  obtain  speech  samples  is  provided 
as  Appendix  F.  This  chapter  discusses  the  data  itself,  and 
the  processing  of  it. 

B.  VOICED/UNVOICED  PHRASES 

Three  phrases  were  chosen  for  their  voiced  and  unvoiced 
characteristics  as  described  in  Chapters  2  and  4.  They  are: 

1)  "READY” 

2)  "SO  WHAT” 

3)  "SNEEZE" 

Each  phrase  was  repeated  at  a  different  pitch  and  to 
make  things  simple,  the  musical  scale  was  picked  to  help 
harmonize  a  change  in  pitch  with  some  type  of  reference.  In 
other  words,  "READY"  was  first  spoken  in  the  middle-C  range. 


and  then  in  the  0  range 


until  it  was  finally  spoken  in  the 


high-C  range. 

This  procedure  yielded  eight  different  pitches  for  each 
of  the  three  phrases.  One  male  speaker  provided  the  data 
for  all  three  phrases.  Additionally  the  period  remained 
constant  for  each  pitch  and  their  individual  utterances. 
For  a  graphical  representation  of  the  selected  speech 
utterances,  refer  to  Appendices  G,  H,  and  I. 

Each  phrase  was  chosen  for  content  and  can  be  classified 
as  voiced,  unvoiced,  or  a  combination  of  both.  "READY"  is 
strictly  a  voiced  word,  whereas  "SO  WHAT"  and  "SNEEZE"  are  a 
combination  of  voiced  and  unvoiced  segments.  The  S,WH,  and 
T  sounds  in  "SO  WHAT"  will  be  our  unvoiced  example,  and 
“SNEEZE"  ,  will  be  the  combined  example  as  the  data  is 
analyzed. 

C.  DATA  PROCESSING 

This  section  discusses  the  techniques  utilized  to 
analyze  the  data  and  the  observations  made. 

1 .  §peech_Data 

The  raw  speech  data  was  edited  and  displayed  using  a 
generic  display  program.  The  data  is  8  bit  information  with 
a  maximum  range  of  256  equally  spaced  values.  The 
resolution  of  each  utterance  varied  with  the  pitch.  The 
lower  frequencies  tended  to  have  less  gain  or  energy  and 


therefore  did  not  use  all  the  256  range  values  available.  A 


summary  of  the  ranges  is  provided  in  Appendix  J. 

The  periods  of  each  phrase  were  different.  The 
differences  between  the  same  utterance  at  different  pitches 
varied  as  much  as  20  msec.  A  short  summary  of  the  average 
periods  are  given  in  Table  1. 


TABLE  1. 


UTTERANCE 

"XXX" 

PERIOD 

sec . 

NO. SEGMENTS 

N 

NO. DATA  PTS./SEG 
<10  msec  SEG) 

"READY" 

• 

0) 

o 

30 

100 

"SO  WHAT" 

• 

o 

40 

100 

"SNEEZE" 

.38 

38 

100 

The  sampling  rate  was  10  kHz  for  all  of  the 
utterances,  so  the  number  of  data  points  in  each  10  msec 
segment  is  100. 

2  •  Q®£®£J5in±ng_Ref  lect  ign_Cgef  f  icients 

Once  the  starting  point  is  determined  for  each 
utterance,  the  reflection  coefficients  are  calculated  for  10 
msec  segments  of  speech  CApp.  K3 .  Successive  segments  are 


analyzed  to  yield  their  respective  reflection  coefficients 


using  Equations  4-11  and  4-12,  as  were  the  sine  wave 
calculations . 

Reflection  coefficients  K1  through  K6  were  plotted 
for  each  of  the  24  utterances  and  several  of  the  resultant 
curves  are  included  as  Appendix  L. 

a.  XEend_Analysis 

A  graphical  trend  analysis  of  the  plotted  data 
was  undertaken  to  detect  any  obvious  patterns.  The  details 
of  that  analysis  is  included  as  Appendix  M.  However,  a 
summary  of  those  observations  leads  us  to  the  conclusion 
that  there  were  not  any  trends  of  any  significance  noted  as 
a  function  of  pitch. 

b .  Graphics l_Correlat ion 

One  graph  was  held  stationary  as  a  reference  and 
the  others  were  passed  over  it  to  see  if  there  was  any 
obvious  match  ups.  There  is  nothing  more  elaborate  to 
report  than  that  no  correlation  was  noted  between  them. 
Even  though  at  times  there  were  2  or  3  points  which  matched 
up,  the  other  28,  36,  or  38  points  did  not.  Also  there 
seemed  to  be  no  distinction  between  voiced  and  unvoiced 
portions  of  the  speech  wave.  This  process  leads  to  the 
conclusion  that  the  various  speech  segments  are  highly 


uncorrelated . 


3 .  SB§ctral_AnalYsia_of _Ref lect ion_Coef f icient_Patterns 
It  waa  noted  during  the  trend  analyais  that  the 
temporal  patterna  preaented  by  the  reflection  coefficients 
aeemed  periodic.  At  firat  it  waa  believed  that  this  could 
possibly  reflect  the  pseudo-periodic  nature  of  speech  or  the 
excitation  source. 

Spectral  analysis  was  implemented  using  a  Fortran 
subroutine  to  compute  the  FFT  of  each  pattern.  The  program 
is  included  as  Appendix  N  and  several  examples  of  the 
results  are  provided  as  Appendix  0. 

In  summary  all  of  the  spectra  turned  out  to  be 
relatively  flat.  This  indicates  that  there  are  no  prominent 
frequencies  within  the  reflection  coefficient  sequences. 

4 •  SBectral^Analyaia^f or _Freguency_Content 

Spectral  analysis  to  determine  the  frequency  content 
of  each  utterance,  as  described  in  Chapter  4,  would  have 
been  useful  had  a  pattern  or  linear  relationship  shown  up  in 
the  observations  mentioned. 

Since  there  are  no  patterns  or  correlations  worth 
mentioning,  exploring  the  specific  frequency  content  of  each 
utterance  would  not  benefit  us.  The  relative  difference 
between  each  frequency,  or  A  f  is  approximately  32  Hz. 

The  range  of  the  utterances  was  chosen  to  coincide 
with  the  musical  scale  from  middle-C  to  high-C  (a  256  Hz 
difference).  Had  a  relationship  been  discovered,  as 
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proposed,  then  s  more  in-depth  spectral  analysis  of  the 
input  speech  would  have  been  in  order. 

D.  SUMMARY  OF  EXPERIMENTAL  RESULTS 

1 ■  Cgrrelatign_Between_Phrases_With_Di£f erent_Pi tches 

The  linear  relationship  postulated  in  Chapter  5 
should  have  yielded  more  obvious  results  if  relationships 
did  exist  between  identical  phrases  spoken  at  different 
pitches.  Three  of  the  four  categories  mentioned  above 
yielded  negative  or  uncorrelated  results. 

2.  Vgiced/Unygiced_Obseryatigns 

Though  there  may  be  other  or  more  sophisticated 
techniques  available  to  analyze  this  data,  the  methods 
mentioned  above  were  sufficient  to  show  that  a  voiced  phrase 
was  no  more  correlated  than  an  unvoiced  phase. 

Since  the  results  were  consistently  negative  or 
uncorrelated  leads  us  to  some  conclusions  about  the  actual 
relationship  between  frequency  content  and  reflection 
coef  f icients . 
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VII.  CONCLUSIONS  AND  RECOMMENDATIONS 


A.  CONCLUSIONS 

A  new  theory  to  transpose  frequency  was  postulated  and 
tested.  Initial  results,  using  sine  waves,  seemed  positive 
and  lead  to  a  further  study  using  speech  waveforms.  The 
preceding  experiment  and  subsequent  analysis  of  speech 
showed  no  apparent  correlation  between  pitch  and  reflection 
coefficient  values.  These  results  may  be  attributed  to  the 
following  reasons. 

1 .  Comglexity_of _Speech 

The  speech  wave  form  is  a  very  complex  combination 
of  gain,  excitation,  and  spectral  content.  To  pick  out  one 
particular  attribute  and  analyze  it  for  a  particular 
phenomenon,  such  as  frequency  content,  may  be  unrealistic. 

Speech  has  historically  been  modeled  as  a 
combination  of  sine  waves.  However,  slow  progress  in  the 
field  of  speech  processing  has  caused  engineers  to  rethink 
this  point  in  terms  of  the  physics  involved  in  generating 
speech.  Thi3  leads  to  our  next  conclusion. 

2 •  EhYsi£«l/Mathematical_Relationshig 

The  experimental  results  indicate  that,  in  this 
case,  there  is  no  obvious  relationship  between  the  physics 


(pitch)  of  speech  and  the  LPC  mathematical  representation  of 
speech  (reflection  coefficients) . 

This  observation  makes  sense  since  reflection 
coefficient  determination  is  based  on  probabilistic  methods, 
error  feedback,  and  random  input  samples,  the  resultant 
output  of  each  lattice  stage  no  longer  resembles  the 
original  signal.  Once  the  error  signal  passes  through  the 
first  stage  of  the  lattice  network,  its  characteristics  have 
been  altered  as  much  as  10  percent.  Reflection  coefficients 
are  therefore  a  tool  for  determining  predicted  error 
calculations  based  on  past  inputs,  and  not  a  physical 
interpretation  of  the  signals  content. 

Just  as  engineers  are  in  error  when  they  refer  to  the 
pattern  that  successive  reflection  coefficients  present  as 
its  spectral  envelope,  reflection  coefficients  do  not 
directly  reflect  the  frequency  content  of  the  signal. 

3 •  Ssriodic/Pseudo- Per iod ic_Dif f erences 

Simulation  and  experimental  results  show  that 
reflection  coefficients  work  differently  with  periodic 
signals  (sine  wave)  than  with  pseudo-periodic  signals 
<  speech ) . 

In  calculating  the  reflection  coefficients  for  a 
sine  wave,  the  samples  of  one  frequency  are  changed  very 
slightly  from  the  previous  frequency's  samples.  Therefore 
the  calculated  reflection  coefficients  also  change  very 


slightly.  This  observation  may  be  useful  in  the  design  of 
an  LPC  musical  synthesizer,  where  frequency  content  and 
adjustment  is  processed  in  a  controlled  environment. 

On  the  other  hand,  speech  behavior  is  mo'  <  random 
than  music.  It  is  pseudo-periodic  in  the  sense  that 
complex  vibrations  are  necessary  to  produce  the  speech 
waveform.  However,  the  rate  and  randomness  at  which  those 

vibrations  change  frequencies  seems  to  prevent  the 

reflection  coefficients  from  having  any  kind  of  linear 
relationship  with  frequency  content. 

It  is  therefore  the  conclusion  of  this  research  that 
the  relationship  between  frequency  content  of  speech  and 
reflection  coefficients  is  sufficiently  complex  that 

modifying  reflection  coefficients  in  qrder  to  transpose 

« 

pitch  will  not  be  practical. 

B.  RECOMMENDATIONS 

The  conclusions  have  3tated  that  there  is  no  iinear 
relationship  present  between  frequency  content  and 
reflection  coefficients.  Recall  that  the  motivation  behind 
this  research  was  based  on  Hall's  research  [Ref.  31 
concerning  pole  shifting.  Therefore  the  following  actions 
are  recommended  if  further  or  more  extensive  study  is 
desired . 

1.  Continue  Hall's  research  using  LPC  as  a  tool  for 
speech  analysis/synthesis,  but  focusing  attention  on 
the  shifting  of  poles  and  not  on  the  adjustment  of 
reflection  coefficients. 
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2.  Use  a  data  acquisition  system  that  yields  12  or  16 
bit  resolution  of  the  speech  samples. 

3.  Build  a  larger  data  base  containing  speech 
utterences  at  different  pitch  levels  and  have  the 
speakers  be  both  male  and  female. 

4.  Have  the  ability  to  match  articulation  patterns  and 
synchronize  points  where  speech  utterences  begin  and 
end. 

5.  Synthesize  the  input  and  processed  speech  to  check 
for  intelligibility  of  the  utterences. 

6.  Use  more  sophisticated  techniques  for  pattern 

recognition . 

It  is  believed  that  the  preceding  recommendations,  if 
followed,  will  help  substantiate  or  refute  Hall's  research 
as  well  as  the  findings  of  this  research.  The  need  for  an 
adequate  technique  for  frequency  transposition  still  exists. 
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FREQUENCY  VARIED  SINEWAVE  PROGRAM 


This  program  determines  the  reflection  coefficients  for  a 
12th  order  lattice  filter  model  of  a  variable  frequency  sine 
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FREQUENCY  (HZ) 

10K  SAMP/SEC,  NOISELESS 


REFLECTION  COEFFS  K 


FREQUENCY  (112) 
SAMP/SEC,  S:N=10:1 


FREQUENCY  (HZ) 
10 K  SAMP/SEC,  S:N=10:1 


APPENDIX  F  -  DATA  ACQUISITION  SYSTEM 


A.  INTRODUCTION 

There  are  a  vast  number  of  data  acquisition  systems  on 
the  market  today.  Though  this  is  the  case,  the  system 
originally  planned  for  the  acquisition  of  this  data,  broke 
down  with  no  hope  of  timely  repair.  When  all  possible 
alternatives  had  been  explored,  it  was  decided  that  the  only 
way  to  accomplish  this  portion  of  the  research  was  to  build 
a  system  capable  of  obtaining  speech  data  samples. 

This  section  will  discuss  the  system,  hardware,  and 
software  utilities  that  were  combined  to  produce  the  desired 
data  samples.  In  an  effort  to  provide  the  novice,  as  well 
as  the  expert,  with  the  information  needed  to  retrace  these 
steps,  anything  worth  documenting,  is.  Additionally,  a 
bibliography  is  provided  in  the  main  Bibliography  of  this 
thesis . 

B.  EQUIPMENT  REQUIREMENTS  AND  SETUP 

Figure  10  shows  the  experiment.  Selected  speech 
utterances  were  recorded  on  a  4-channel,  8-track  tape 
recorder  and  stored  for  later  use.  The  analog  to  digital 
(A/D )  circuit  was  built  and  driving  software  written.  This 
circuit  was  interfaced  with  the  Zenith-100  microcomputer 


through  the  Prolog  7804-Z80A  Processor  Counter/Timer  Card 
and  the  8255  Parallel  Peripheral  Interface  (PPI)  microchip. 


Data  Acquisition  3~Dimensional  Flow  Chart 

Figure  10. 

Once  the  data  was  captured  in  the  Prolog's  32K  buffer, 
it  was  uploaded  to  the  Zenith-100,  via  ZMDS  software,  and 
stored  in  Intel-Hex  data  files.  The  files  were  transferred 
from  the  Zenith  formatted  disk,  via  an  Osborne 
microcomputer,  and  placed  on  Kaypro  formatted  disks. 

A  Kaypro  10  microcomputer  converted  the  hexadecimal  data 
into  decimal  data  using  Microsoft  Basic  (MBASIC)  software. 
Edited  versions  of  these  files  were  then  transferred  to  the 


IBM-3033  main  frame  computer  for  data  processing 


C.  ANALOG  TO  DIGITAL  CIRCUIT 


The  chip  that  provides  the  analog  to  digital  conversion 
is  the  AD-570.  It  provides  8-bit  information  at  sampling 
rates  up  to  33K  samples/second.  For  our  purposes,  the 
sampling  rate  was  set  at  10K  since  the  majority  of  the 
frequency  content  is  below  5  kHz. 

The  circuit  diagram  CApp.  F.2]  illustrates  the 
interfacing  between  the  8255  PPI  chip  and  the  Host  computer. 
The  8255  coordinates  all  of  the  necessary  handshaking  in 
driving  the  AD-570  chip. 

It  was  necessary  to  amplify  the  signal  prior  to  entering 
the  AD-570,  to  obtain  full  use  of  the  256  amplitudes 
available.  It  was  also  necessary  to  provide  an  adjustable 
DC-offset  to  assure  a  unipolar  input  (i.e.  the  middle  value 
had  to  be  adjusted  to  be  level  128  instead  of  level  0) . 

Also,  the  signal  was  filtered  prior  to  data  acquisition, 
through  the  use  of  a  Butterworth  filter  designed  with  a 
frequency  cutoff  of  5  kHz.  This  helps  smooth  the  data. 
However,  during  the  processing  of  the  data  it  may  be 
necessary  to  filter  it  again.  These  additional  circuits  are 
also  provided  as  Appendix  F.2. 

D.  MICROCOMPUTER  INTERFACE 

The  flow  chart,  provided  as  Appendix  F.3,  illustrates 
the  2-80  assembly  language  program.  Appendix  F.4,  that  was 
needed  to  drive  the  A/D  circuit  and  collect  the  speech  data. 
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The  program,  A2D.ASM,  was  also  useful  in  testing,  step  by 
step,  the  proper  operation  of  the  circuit. 

The  Z-80A  micro-processor  is  at  the  heart  of  the  system 
and  the  software  designed  to  drive  it  is  assembled  using 
the  Macro  Assembler  (MSO)  and  linked  to  the  Prolog  station 
using  Link  software  <L80> .  For  more  information  on  these 
procedures  refer  to  the  Bibliography. 

1 .  Sampl ing_Rate 

The  sampling  rate  is  not  arbitrary.  It  i3  a 
function  of  the  software.  In  assembly  language  programming 
each  step  that  the  microprocessor  goes  through  takes  a 
specific  amount  of  time.  We  will  refer  to  a  measure  of  time 
as  a  T  state.  Each  T-state  equals  the  inverse  of  the  clock 
rate  interfaced  with  the  2-80  chip.  Since  we  are  using  a  4 
MHz  clock,  one  T-state  equals  250  nano-seconds. 

Every  command  line  in  the  assembly  program, 
including  the  command  'No  Operation'  or  NOP,  requires 
several  T-states  to  accomplish  its  task.  We  are  interested 
in  the  interval  of  time  it  takes  from  one  sample  to  the 
next,  and  then  we  modify  the  software  accordingly. 

This  program  has  a  delay  loop  in  it  (labeled  DELAY) 
to  slow  down  the  data  acquisition  to  10K  samples/second.  If 
it  did  not  have  the  delay  loop  in  it,  it  would  easily  sample 
at  23K  samples/second.  Since  each  utterance  was  limited  to 


REPRODUCED  AT  GOVERNMENT  EXPENSE 


leas  than  one  second,  10K  samples  la  workable  and  does  not 
present  prohibitive  record  lengths. 

E.  DATA  FILE  SETUP  AND  MANIPULATION 

Once  the  data  is  collected  and  stored  in  the  Prolog's 
32K  buffer,  it  is  uploaded  onto  a  Zenith  100  formatted 
floppy  disk  and  stored  in  an  appropriately  titled  HEX  file. 
A  sample  of  a  typical  segment  of  data  is  provided  as 
Figure  11. 

:  1C5AI0G08071 807E71 7I79777675^7£?E7E7F££52I5 
'  :  12£AF00085ES86£££€£S£2E07I7F5ZE07C727I7IS3 
Sl0£30000£081£0£0£l £06l?F6C7 • 527E7E7C7E7I AZ 
:10551G007I7C7E?Z?I?I=e£08Z7I £07r£J7I?I7I5T 
:  105B2Z027I 7I7E7C777Z777F83S27I E0878I5A877A 
Jl25E200086£££££27I7C7E7E7E7B7£7I£0£Z£1806A 
:  105340008C6 1828280 717 E7C7C 7 17 C7D7Z71 50 60 5 E 

Hexadecimal  Data  File  Segment 
Figure  11. 

The  file  is  in  Intel-Hex  format.  The  colon  starts  off 
each  line.  The  following  '10'  tells  us  that  the  line  is 
full  of  data.  The  next  four  digits  indicate  the  memory 
location  in  the  buffer.  Every  two  bits  following  the  memory 
location  represents  a  byte  of  information. 

Following  a  double  0,  there  are  16  records  of  data,  and 
then  a  checksum  byte  at  the  very  end.  For  our  purposes  the 
first  nine  digits  and  the  last  two  digits  are  of  no  use. 
The  Intel-Hex  file  is  already  in  ASCII  format. 

An  Osborne  microcomputer  was  used  to  transfer  data  from 
the  Zenith  100  formatted  floppy  disk  to  a  Kaypro  formatted 
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floppy  disk.  Since  the  data  is  needad  in  integer  form  to  do 
the  necessary  processing,  a  program  was  written  CApp.  F.53, 
in  Microsoft  Basic  Language  (MBASIC),  to  convert  the  data 
files  from  hexadecimal  into  the  equivalent  integer  values. 

Finally,  the  data  is  ready  for  processing.  Since  the 
software  was  already  written  on  the  IBM-3033  to  process  and 
display  the  data,  it  was  sent  there  via  a  1200  baud  modem, 
and  processed . 


APPENDIX  F . 2 


CIRCUIT  DIAGRAMS  FOR  THE  DATA  ACQUSITION  SYSTEM 
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APPENDIX  F.5-INTELHEX  TO  DECIMAL  DAJA  CONVERSION  PROGRAM 


This  program  is  designed  to  read  a  data  file  that  is  in 
Intelhex  format  and  convert  it  to  an  integer  file. 


2  PRINT  "INPUT  FILE  "s INPUT  FI* 

4  PRINT  "OUTPUT  FILE  "s INPUT  FO* 

£0  OPEN  "0",£,  FO* 

30  OPEN  "I", 1, FI* 

40  INPUT  41, IN* 

60  IF  PUD*  <  IN*,  £,  1  )  ="0“  7 HfcN  CLOSEibOTO  140 
70  FOR  1*10  TO  40  STEP  £ 

60  HX*=miD* < IN*, i, £) 

90  V%=VPL("&H"+HX*) 

95  IF  A=40  THEN  PRINT  *£,  US1NU"444"  ;  V%  ELSE  PRINl  *2,  USIN6"44»”  ;  V% 
93  IF  1*40  THEN  PRINT  USING  ELSE  PRINT  USING  "444"  ;  V7. ; 

100  NEXT  I 
125  PRINT 
130  GOTU  40 

140  PRINT  "DONE"  -*•  CHH*</> 
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APPENDIX_K  -  SPEECH_REFLECTION_CQEFFICIENT_PROGRAM 
This  program  is  designed  to  determine  the  reflection 
coefficients  for  a  12th  order  lattice  filter  model  of 
speech.  It  yields  a  new  set  of  K's  every  10  ms. 
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Figure  L.l.  Sneeze-E  Pattern  of  Reflection  Coefficient  K1 


1 


81 


ME  X  1 


TIME  X  10  MSEC 


3arUWDVW  \iiHOD  N0IlD31i3H 


Figure  L.4.  Sneeze-E  Pattern  of  Reflection  Coefficient  K4 
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APPENDIX  M 


TREND  ANALYSIS  RESULTS 


The  following  lists  are  the  observations  made  on  the 
reflection  coefficient  curves  for  each  utterence. 


"READY" 

K1  -  All  pitches  have  relatively  flat  curves.  The 
magnitudes  vary  slightly  between  ♦.8  and  -*-1.0.  The  higher 
the  pitch,  the  more  defined  the  troughs  are. 

K2  -  These  curves  all  had  the  unique  feature  of  sloping 
upward.  They  generally  ranged  from  -.4  to  *.9.  No  other 
correlation  was  noted. 

K3  -  A  negative  sloping  tendency  characterized  this  set.  of 
curves . 

K4  -  Each  of  these  curves  had  a  plateau.  Ready  B,  however 
did  not  fit  in  with  this  set  at  all. 

K5  -  These  curves  seemed  to  stay  within  a  similar  range,  .3 
to  -.7.  Also  several  prominent  peaks  were  uncorrelated. 

KS  -  No  correlations  were  noted,  however  Ready-B  was 
drastically  different. 


"SNEEZE" 

K 1  -  Relatively  flat  curves.  Ranges  from  .8  to  1.0. 

K2  -  Highly  uncorrelated  curves. 

K3  -  Also  highly  uncorrelated  curves,  however,  more  flat 
than  K2. 

K4  -  MC  and  D  are  similarly  flat,  the  rest  3eem  correlated 
with  a  valley  to  an  elevated  flat  plateau. 

K5  -  There  seems  to  be  a  peak,  then  a  declining  trend  in 
most  of  these  curves.  Again  MC  and  D  don't  fit  this 
observation  and  are  generally  flat. 

K6  -  There  are  several  peaks,  then  relatively  flat  curves. 
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"SO  WHAT" 


K1  -  Similarly  flat  patterns. 

K2  -  Highly  uncorrelated  with  no  recognizable  patterns. 

K3  -  There  is  a  prominent  valley  in  all  of  the  observations 
except  A . 

K4  -  Highly  uncorrelated  with  no  recognizable  patterns. 

K5  -  Highly  uncorrelated  with  no  recognizable  patterns. 

K6  -  Highly  uncorrelated  with  no  recognizable  patterns. 
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APENDIX.Q  -  FREOUENCY_CQNTENT_OF_KiN). 

This  is  an  example  o£  the  output  from  the  FFT  program  to 
determine  if  there  are  any  discrete  frequencies  present  in 


Figure  0.1.  Reflection  Coefficient  K6  for 

Utterence  'Ready-MC'. 
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Figure  0.2.  Reflection  Coefficient  K3  for 

Utterence  'Ready-MC'. 
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